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ABSTRACT 


The curse of dimensionality and the empty space phenomenon emerged as a 
critical problem in text classification. One way of dealing with this problem 
is applying a feature selection technique before performing a classification 
model. This technique helps to reduce the time complexity and sometimes 
increase the classification accuracy. This study introduces a feature selection 


technique using K-Means clustering to overcome the weaknesses of 
traditional feature selection technique such as principal component analysis 
(PCA) that require a lot of time to transform all the inputs data. This 
proposed technique decides on features to retain based on the significance 
value of each feature in a cluster. This study found that k-means clustering 
helps to increase the efficiency of KNN model for a large data set while 
KNN model without feature selection technique is suitable for a small data 
set. A comparison between K-Means clustering and PCA as a feature 
selection technique shows that proposed technique is better than PCA 
especially in term of computation time. Hence, k-means clustering is found 
to be helpful in reducing the data dimensionality with less time complexity 
compared to PCA without affecting the accuracy of KNN model for a high 
frequency data. 
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1, INTRODUCTION 

A huge amount of information can be obtained from different form of data with the rapid growth of 
the Internet. These data are virtually provided through digital libraries, e-commerce websites, social 
networks, mobile applications and other sources [1]. Currently, one of the major form of data is unstructured 
text [2]. These data are complex and not well-organized unlike structured text. They normally face the curse 
of dimensionality. A vector of word counts in a vector-space model of text documents may consists 
dimensionality more than 10,000 and the given sample size need to estimate a function of several variables to 
provide a good accuracy of the model. However, most of the high dimensional data are inherently sparse data 
[3]. For instance, a word may appear 100 times in one document but may not appear in other documents. 
Hence, the data need to undergone a good data pre-processing process to obtain the best structure of data to 
be used as an input for prediction or classification models. 

One possible approach to simplify a high dimensional data is to apply some form of dimensionality 
reduction [4]. This can be done in two different ways either by using feature extraction or feature selection. 
In feature extraction, the original vector space is transformed into new vector space according to special 
characteristics. On the other hand, feature selection is used to keep the most relevant variables from the 
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original data set. The utilization of both techniques accordingly provide a better data pre-processing process 
[5S]. Many researchers claimed that principal component analysis (PCA) is the most popular feature extraction 
method [6-8]. PCA is a classical statistical technique to transform attributes of data set into new set of 
uncorrelated attributes known as principal components. This technique is used to reduce the dimensionality 
while maintaining as much of the variability of the data set as possible [9]. 

PCA can increase the efficiency given the classifiers taking place in a smaller dimension but when 
using this technique, the time requirement for pre-processing the data is increasing tremendously. PCA is an 
unsupervised technique, which makes no use of information related to the class variable. There 1s another 
form of unsupervised technique called as clustering technique. One of the well-known clustering technique is 
k-means clustering. The simplicity and efficiency of this clustering algorithm make it useful for discovering 
the structure of data. Hence, this study proposed alternative method to reduce the dimensionality of the data 
by using the feature selection technique with k-means clustering. The comparison between PCA and k-means 
clustering in selecting the features for high dimensional data are provided in the study. 


2. DATA BACKGROUND 

The study presents experimental results using two types of data sets which are real and synthetic 
data sets. There are two real data sets used in the study. Table | shows the description of the two corpora 
selected for this study. The first data set has been collected from one of the major chain market online stores 
in Malaysia using prototype web scrapers developed under STATSBDA project known as Price Intelligence 
(PI) by Department of Statistics Malaysia (DOSM). A few of leaf nodes are selected to represent categories 
from the browse tree of the website. The data corpus consists of products’ description for four categories 
under baby products which are baby diapers and wipes, baby milk powder, baby food, and baby toiletries. 


Table 1. Summary Description of Data Sets 


Dataset Category Instance | Number of Feature 
Baby 4 A7] 387 
SMS Spam 2 5574 6981 


The second data set is a well-known data collection that provides an ideal benchmark used to 
evaluate text classification model [10]. The SMS spam messages data set is originally collected from the 
Grumbletext Website where cell phone users make public claims about SMS spam messages [11]. This data 
set consists of two categories either ham or spam message. The study has also used a simulation data to 
compare the performance of selecting the features between K-Means clustering and PCA. Each class is 
composed of a number of normally distributed clusters. A Normal distribution with mean and standard 
deviation equal to zero and one accordingly is used to draw number of useful independent features for each 
cluster. The simulation data deal with two-class classification problem with sparse binary input features. The 
data is generated through a hypercube data program [12] which is appended in scikit-learn of 
python programming. 


3. RESEARCH METHOD 

There are several steps involve in performing a text classification. This study is composed of the 
basic steps which are data extraction, data preprocessing and feature extraction. There are several steps 
involve in pre-processing the data which are tokenization, word stop removal, and stemming process[13]. 
This study has used bag-of-word to extract the features before performing the feature selection to reduce the 
dimensionality of the data. All the procedures use in the study are implemented through R-Programming 
Software. The software has been widely used to solve a statistical problem in various field of studies include 
in study of population growth [14], age prediction [15], pattern recognition [16] etc. 


3.1. Data Dimensionality Reduction Techniques 
There are two different feature selection approaches use in the study which are PCA and feature 
selection with K-Means clustering. 


3.1.1 Technique I: Principal Component Analysis (PCA) 

PCA is a linear method uses to embed the data into a linear subspace of lower dimensional. The 
steps involve are shown in Figure 1. The method finds a linear basis which 1s possible orthogonal of reduced 
dimensionality for the data with containing the maximum number of variance in the data. Mathematically, let 
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P be a matrix of data with N observations and F' features and let p(n) represent the n'" row vector. The data 
are transformed into the principal component space by fy) = Ww; - Pm), Where w; is the /-dimension loading 


vector and f;,) 1s the j component score. The weight of the first principal component w, is found by 


were 
Ww, = arg max ——_—— 
ww 
(1) 
The next principal components can be obtained by subtracting the first 7 components from the data, 
A i 
ae is > PWaW" m 
m=l (2) 
and the loadings is calculated by, 
; w" BT By 
Wj = arg max = 
(3) 


Normally, the first few principal components consist a majority of the variance. However, the 
number of principal components need to be included in the new transform data depends on the ability of the 
j" principal components to provide full information about the actual data. 


Data Extraction 


Pre-Processing 


Feature Extraction 


Feature Selection (PCA) 


Classification 





Figure 1. Text classification with feature selection using PCA 


3.1.2 Technique II: K-Means Clustering 

The k-means clustering is a well-known algorithm that follows a gradient descent procedure [13]. 
The features undergo the first level of feature selection with using one of feature selection techniques namely 
Correlation-based feature selection (CFS). It 1s used to filter the feature before using the clustering technique. 
The steps involve are shown in Figure 2. Given the data set size of n with data points of py, po, ..., Pn Where 
each data point is in the K*. Then, the minimum variance clustering of the data set is separated into k clusters 
by finding the k points {m,} (c=1,2, ..., k) in K® such that, 


In. op 

— » [mn . d“(x;,m, )] 4 
: 2, (x;,1M¢) (4) 
Is minimized, where d (x;, m,) denotes the Euclidean distance between x; and m,. The technique 


begins with randomly select the cluster centroids, and iteratively updates these centroids to decrease the 
objective function in (4). The algorithm will keep updating the cluster centroids until the local minimum is 


Indonesian J Elec Eng & Comp Sci, Vol. 16, No. 2, November 2019 : 752-758 


Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 O 155 


found. After obtaining the desired clusters, the CFS technique is re-apply to the clustered data for reducing 
the feature in each cluster. The features in each cluster are gathered back together as the input data for the 
classification model. 


Data Extraction 
Pre-Processing 


Feature Extraction 


Feature Selection (CFS) 


v 


Clustering Technique (K-Means Clustering) 


Clustered Clustered Clustered 


Clustered Data 2 Data 3 Data n 


Data 1 


Classification 





Figure 2. Text Classification with feature selection using k-means clustering 


3.2. Classification Model 

A supervised machine learning model is widely used by researches to solve classification problem 
[17]. The classification model that will be used in the study is K-Nearest Neighbor (KNN) model. The model 
is claimed to be one of the most effective classification models in text mining [18-19]. KNN is an instance- 
based learning where the function is only approximated locally and all computation is done during the 
classification. During the learning process, each item is assigned to a class represented by the majority label of 
its k-nearest neighbors in the training data set [20]. This study used the default nearest neighbor rule with the 
K value equal to one. The generalized pseudocode for KNN algorithm [21] is represented in Figure 3. The 
performance measures used to evaluate the trained data are accuracy, precision, recall and Fl-measure. This 
study also measures the execution time of each classification model because it is also one of the important 
result can be measured from a study [22]. 


for all the unknown samples (p) 
for all the known samples (q) 
Compute the distance between each of p and q 
end for 


Find the k smallest distances locate the 
corresponding samples, pi ..., px 
Assign each of q to the class which appears 
more frequently 

end for 





Figure 3. KNN Algorithm 


4. RESULTS AND DISCUSSION 

A feature selection technique is introduced to reduce a data complexity before performing 
classification model. This study found interesting outcomes related to usefulness of k-means clustering to 
reduce the dimensionality of high frequency data set. The performance evaluation for two real data sets used 
in the study is shown in Table 2 and 3 accordingly. Both data sets are partitioned into 70% of training text 
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data and 30% of testing text data. There are three KNN models involve which are no feature selection 


(KNN), feature selection using K-Means Clustering (KM-KNN) and feature selection using PCA 
(PCA-KNN). 


Table 2. Performance Measure for Baby Data Set 


Model Accuracy (%) Precision Recall F1-Measure Execution Time (Second) 
KNN 97.18 0.9717 0.9767 0.9793 0.81 

PCA-KNN 97.18 0.9717 0.9767 0.9793 1.56 

KM-KNN 97.18 0.9717 0.9767 0.9793 1.06 


Table 3. Performance Measure for SMS Spam Data Set 


Model Accuracy (%) Precision Recall F1-Measure Execution Time (Second) 
KNN 95.16 0.9701 0.8026 0.7268 510.59 
PCA-KNN 95.34 0.9748 0.8079 0.7322 1587.03 
KM-KNN 95.52 0.9561 0.8280 0.7686 490.81 


The performance of the three models are similar for baby data set. However, the KNN works faster 
than the other two models. Meanwhile, there is an improvement for the performance of KM-KNN compared 
to other models for sms spam data set. In addition, the model also consumes less computation time. From the 
comparison, it is shown that an accuracy of a small data set may not be affected by a model without any 
feature selection techniques. However, these techniques seem to help in increasing the model accuracy and 
efficiency for a large data set. This study has also found that PCA is able to reduce data dimensionality but it 
requires a certain amount of time to transform the data before performing the classification model. This study 
has also observed the performance of both feature selection techniques through simulation data. The 
comparison between the three models are visualized in Figure 4. 


100 


Number of Sample Number of Sample = 500 


Number of Sample = 1000 Number of Sarmple = 5000 


Figure 4. Comparison of Accuracy and Execution Time between Three KNN Models for Simulation Data 


It 1s shown that the accuracy of the data is remained the same after applying feature selection 
techniques such as PCA and K-Means Clustering. The possible explanation is a feature selection technique 
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may act as a way to reduce the dimensionality and ease the computation of KNN model but it does not 
influence the performance of the model. This result 1s supported by previous studies where they claimed 
there is a high tendency that the complexity of the computation is being reduced without affecting the 
performance of a classification model [23-24]. Hence, this study found that the accuracy of KNN model 
remains the same with the application of feature selection towards normally distributed data set. 

It is also apparent from Figure 4. that PCA requires a lot of time to transform the data with the 
increases number of samples and features. The result in line with previous studies that found the 
disadvantage of PCA when applied to large datasets where a huge amount of time is required in performing 
an eigenvalue decomposition to find the principal components [9, 25]. Meanwhile, it 1s noticeable that the 
execution time for KM-KNN is getting closer to KNN model as the number of feature increases from 100 to 
10000. This shows that K-Means clustering is useful in reducing the data dimensionality with less amount of 
time for high frequency data set. 


5. CONCLUSION 

This study is mainly focused on evaluating the efficiency of KNN model using feature selection 
techniques. The most obvious finding to emerge from this study is that k-means clustering helps in increasing 
the efficiency of KNN model for a large data set. This study has also identified that KNN model without 
feature selection technique is suitable for a small data set. The proposed feature selection technique with 
using K-Means clustering performs better than the existing well-known feature selection technique which is 
PCA. This technique is helpful because researchers often deal with large number of features especially in 
text mining. 
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